Big data analysis and visualisation
Session 1: Getting your development environment in order
J.M.T. Roos
Last updated: 2017-05-08 11:55:08
Learning by doing
- Coding is not as difficult as people think
- My approach to teaching people to code is
- Have them type (and click on) things they don’t fully understand
- (Because they are curious) they then ask “Hmm…why did that work?”
- I then explain what’s going on
- That’s the approach we’ll take today
Doing before understanding
- First, you will
- Set up a local workspace
- Configure your source control software
- Create an online repository for your work
- Set up your local project files
- Track them using version control
- Make edits, commit changed, push changes to the online repository, etc.
- Ask questions as we go, and I will explain what all this means
Sticky notes
- Place a sticky note on your computer to signal:
- Please try to help each other first, you’ll learn more that way…
- At the end of each session (e.g., before each break), write on the cards and stick then on the door
Setup your local working directory
Do this once for each computer you work on
- Create a top-level directory where all of your code will go
- Each project will have in its own directory underneath this one
- I call mine
git_workspace – I will use this name as a placeholder throughout these slides
Tell git who you are
- Start a new terminal (OS X) or bash (Linux) or git-bash (Windows) session
- Execute the following commands, substituting your information
- Note: Use the email address associated with your GitHub account
$ git config --global user.name "John Doe"
$ git config --global user.email "johndoe@example.com"
- OS X users might need to issue the following commands as well
$ git credential-osxkeychain
$ git config --global credential.helper osxkeychain
Create a new repository on GitHub
Do this once for each project you work on
- Log into GitHub (https://github.com)
- Click on New repository
- Enter the name
session_1 and click Create repository
Get the URL for your new repository
- Click on the HTTPS button under Quick Setup and copy the URL to your computer’s clipboard
- I use
https://github.com/jasonmtroos/session_1.git in subsequent examples
- Note: Your repository is empty in the sense that you have not added any files to it yet
Create a new RStudio project
- Open RStudio
- Select File > New Project…
- Click on Version Control and then on Git
- Paste the URL you copied from Github into the Repository URL field
- Hit tab, then ensure Project directory name is set to
session_1
- Under Create project as subdirectory of: click Browse… and then navigate to your
git_workspace directory
Create a new R Markdown file
Do this one or more times for each project you work on
- Select File > New File > R Markdown…
- Set the Title to
My First Markdown
- Enter your name as the Author and click OK
Save your new R Markdown file
- Select File > Save
- Name the file
my_first_markdown.Rmd and make sure it gets saved into git_workspace/session_1 (this should be the default)
Tell git to keep track of your new file(s)
- Click on the Git tab in the upper right panel
- Check each of the boxes next to the three files listed
- Click on the Commit button
- In the window that pops up, type
Initial add in the Commit message window
Commit your changes
- Click Commit
- RStudio will now interact with git behind the scenes
- When they are done, RStudio will open a small window showing what git had to say
[master (root-commit) 2757d93] Initial add
3 files changed, 47 insertions(+)
create mode 100644 .gitignore
create mode 100644 my_first_markdown.Rmd
create mode 100644 session_1.Rproj
Push your changes to GitHub
- Click on the Push button in the top right corner
- If you receive the following error…
error: unable to read askpass response from 'rpostback-askpass'
fatal: could not read Username for 'https://github.com': Device not configured
- …click on the Gear icon > Shell… and type
- …then enter your GitHub username and password when prompted
To https://github.com/jasonmtroos/session_1
f5d8fc0..dbea7bb master -> master
Compile your R Markdown file
- Click on the Knit button
- Your file will be used to generate a new HTML document, which will be displayed in a special document viewer window
Make a change to your R Markdown file
- Change the boilerplate text in the R Markdown file you just created by inserting
This is my first markdown file. above ## R Markdown
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This is my first markdown file.
## R Markdown
This is an R Markdown document. Markdown...
- Save your changes by selecting File > Save
- Click on Knit again to see how your file has changed
Commit your changes and push them to GitHub
Do this whenever you make changes to your file you want to keep
- Click on the Git tab
- Check the box next to
my_first_markdown.Rmd (ignore the file ending in .html for now)
- Click on Commit
- Enter the commit message
Added line to introduction
- Click on Commit (NB: not the same button as in the previous step)
- Click on Push
- Close the dialog showing the output from git
View your file’s history
Do this when trying to figure out where and when you introduced a mistake into your file
- Highlight
my_first_markdown.Rmd and then click on the History button
- You will see a summary of all versions of this file that you have previously committed to your local git repository
View your remote repository
- Go to (https://github.com/jasonmtroos/session_1) substituting your GitHub username for mine
- Here you will see the current state of your repository after your last Push
- If you click on
my_first_markdown.Rmd and then the History button you will see a similar summary to the one provided by RStudio
What you’ve done so far
- Set up your workspace
- Created a git repository on GitHub and cloned it into your local workspace
- Created an RStudio project and added an R Markdown file to it
- Compiled your R Markdown file into an HTML document
- Committed your edits to git and pushed them to GitHub
- Viewed your commit history
What’s next
- Verify you previously installed required R packages
- Edit your R Markdown file so that it does something useful
- Get used to keeping track of files using git
- Understand how git, R, RStudio, and R Markdown can help you become a more organized and efficient researcher
Installing packages
- You should have already installed a number of packages as part of the pre-work.
- Check to see if the following code generates an error:
- If you get an error, then run this now:
install.packages("tidyverse", dependencies = TRUE)
- And re-read the pre-work assignment carefully
Edit your R Markdown file
- Delete everything that comes after the following lines of code
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
[Deleted from this point forward]
- We will now do something useful with this file…
- Edit your file to include the code in the following slides
Download data
- Add this code to your R Markdown file
Download sleep study data
```{r}
library(readr)
sleep <- read_csv("https://vincentarelbundock.github.io/
Rdatasets/csv/lme4/sleepstudy.csv") # no spaces
head(sleep)
```
Commit your changes
- Click on the Git tab
- Check the box next to
my_first_markdown.Rmd (ignore the file ending in .html for now)
- Click on Commit
- Enter the commit message
Replaced boilerplate with code to download sleep study data
- Click on Commit
- Do not click on Push yet
Committing and pushing
- The change you just committed is reflected in your local git repository and its history (viewable via RStudio)
- This change will not be reflected on GitHub until you push those changes
- In general, you make many changes locally (committing them as you go) and then push them to GitHub less frequently
Plot the data
- Add this code to your R Markdown file
Plot the data by subject
```{r}
library(ggplot2)
ggplot(sleep, aes(x = Days, y = Reaction)) +
geom_point() +
geom_smooth() +
scale_x_continuous(breaks = seq(0, 8, by = 2)) +
facet_wrap(~ Subject)
```
Group and summarise
- Next, add this code to your R Markdown file
Summary statistics by subject
```{r}
library(dplyr)
sleep %>%
group_by(Subject) %>%
summarise(mean(Reaction))
```
Mean-center reaction times and plot on a single axis
```{r}
sleep %>%
group_by(Subject) %>%
mutate(mean_centered_reaction = Reaction - mean(Reaction)) %>%
ggplot(aes(x = Days, y = mean_centered_reaction,
colour = factor(Subject))) +
geom_smooth(show.legend = FALSE, se = FALSE) +
scale_x_continuous(breaks = seq(0, 8, by = 2))
```
Commit your changes and push them to GitHub
- Click on the Git tab
- Check the box next to
my_first_markdown.Rmd (ignore the file ending in .html for now)
- Click on Commit
- Enter the commit message
Added code to create plot by individual
- Click on Commit
- Click on Push
- Close the dialog showing the output from git
Why are we doing all of this?
- Workspace organization and git
- Automation and reproducibility via R Markdown
Because without these practices, you cannot expect to successfully manage work that involves large and/or disparate data sets
Workspace ideas
- Each project has its own folder under
git_workspace
- There are files you are allowed to change, and files you can never change
- Raw (i.e., original) data files can never be changed
- If a file can be changed, it must live under
git_workspace/project with changes tracked by git or some other version control software
- Data files can also live under
git_workspace/project/data-raw provided you never alter them
- Altered data files should be placed in a different directory from the originals (e.g., in
git_workspace/project/data)
Reproducibility and automation ideas
- Your data management code and analysis should be transparently documented and controlled by a repeatable, automated process (e.g., by knitting an R Markdown file)
- You (or someone else) should be able to reproduce your analysis exactly by
- Cloning the git repository
- Opening RStudio
- Knitting your R Markdown file
- …or… by reading the HTML output of your R Markdown file and entering commands directly into R exactly as they appear in the HTML file
More about git
- git is a command-line tool, but RStudio (and a few other tools) offer GUI’s for performing simple tasks
- My goal for this course is that everything we do with git can be performed via RStudio’s interface
- There is much, much, much more that git can do:
- Merging changes made by multiple collaborators
- Creating different versions of the same code for different purposes
- Committing changes to experimental code without affecting the main code base
- Creating a snapshot that records what your code looked like at the point you submit a manuscript
- If you want to know more about git, check out GitHub’s list of resources for learning git and GitHub
More about R Markdown
- Based on the idea of literate programming
- Simple rules for formatting text as HTML (or PDF via latex)
- Code can be inserted inline with other text using special delimiters
- When compiling the markdown file, the code is executed and results are inserted into the final HTML file
## [1] 52
Like so.
R Markdown example
This is the code I used to generate the preceding slide.
More about R Markdown
=====================
* Based on the idea of *literate programming*
* Simple rules for formatting text as HTML (or PDF via latex)
* Code can be inserted inline with other text using special
delimiters
* When compiling the markdown file, the code is executed and
results are inserted into the final HTML file
```{r}
5 * 10 + 2
```
Like so.
More about R Markdown
- This is relatively easy to learn, and you should do so
- In particular, you can control how your code is processed, the size of graphics, etc. via code that looks like this:
```{r fig.width=5, fig.height=4}
- Code like
fig.width = 5 is processed by a package called knitr if you want to learn more about customizing output
- RStudio has an entire site dedicated to the R Markdown file format, which you should visit and read this week
What you need to do this week
- Ensure your R environment is in order (see next slide)
- Get familiar with R Markdown (mandatory) and git (optional) – see previous slides for links
- Follow the R tutorials I developed for my Marketing Analytics MSc course
- Assignment 1 – learning R (mandatory if you are lost right now, otherwise optional)
- Assignment 2, sections 1–2 – linear and logistic regression (mandatory if you don’t know what
y ~ x means in R, otherwise optional)
- Assignment 3, section 2 – ggplot2 (mandatory)
- Assignment 4 – more ggplot2 and dplyr (mandatory)
- Email me with questions: roos@rsm.nl
Get your R environment in order
install.packages("devtools", dependencies = TRUE)
install.packages("tidyverse", dependencies = TRUE)
install.pacakges("Hmisc", dependencies = TRUE)
install.packages("tm", dependencies = TRUE)
install.pacakges("twitteR", dependencies = TRUE)
install.pacakges("topicmodels", dependencies = TRUE)
- Windows users will need to install RTools, which provides the necessary toolchain for compiling R packages from source code
A few things in R you should know
- Not covered in my tutorials, so let me know if you need resources covering these concepts
for (i in 1:10) {
if (i == 8) {
print(paste("i =", i))
}
}
## [1] "i = 8"
i <- 1
while (TRUE) {
i <- i + 1
if (i == 8) {
print(paste("i =", i))
break
}
}
## [1] "i = 8"
my_func <- function(x, y) {
x + y
}
my_func(2, 3)
## [1] 5